Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Improve transformer_squad reader to avoid duplicate tokenization of context in training #263

Merged
merged 2 commits into from
May 17, 2021

Conversation

MagiaSN
Copy link
Contributor

@MagiaSN MagiaSN commented May 15, 2021

In training, the same context will be used in multiple instances, this pull request add cached_tokenized_context to avoid duplicate tokenization of the context. This reduce the preprocessing time from 30m49s to 13m35s on my machine, and yields exactly the same dev results as the original implementation.

Copy link
Member

@epwalsh epwalsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great improvement! Thanks @MagiaSN 🙂

@epwalsh epwalsh merged commit dea182c into allenai:main May 17, 2021
@MagiaSN MagiaSN deleted the transformer_squad_dev branch May 18, 2021 04:18
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants